q21ubfandomcom-20200214-history
How to Capture the Zeitgeist via Eigen-tweets?
A Short Answer Arguably, the aforementioned enormous daily load of tweets largely encapsulates the worldwide emerging trends and breaking news. Therefore, in the past few years, tweet trend-clustering and geo-temporal trend tracking (a.k.a. zeitgeist tracking) has attracted considerable interest, with a plethora of online applications B. Parr (2009, Apr. 29). "Swine flu's tweet tweet causes online flutter," ''Business Standard'' [Online]. Available: http://mashabble.com/2009/04/04/twitter-trends performing data (tweet message) and metadata (tweet author, location, number of re-tweets, etc.) search queries on the open tweet database. Modern popular techniques are either (i) singling out the most popular tweets or (ii) delivering in single words (or hashtags) the pivotal points of emerging trends and breaking news. Although existing approaches seem to largely capture the ``spirit of the time", they suffer from some major drawbacks. On the one hand, the ``popular tweets" approach fails to accomplish ''complete'' trend identification, since the most popular tweets may span only a small subset of the emerging trends, while, on the other hand, single-word trend titling lacks in concept ''descriptiveness''. Following motivational Example 1 highlights these drawbacks. Example 1 In Table I, lies a pool of 8 tweets, available for processing. We consider these tweets to have been posted simultaneously in [http://en.wikipedia.org/wiki/Greece Greece] and to be, other than geographically, metadata-wise independent. In the second column, as a measure of popularity, we append the aggregate number of re-tweets, favors, and replies. Say that we are interested in extracting 3 objects (let them be words, phrases or tweets) that encapsulate the major emerging trends and/or breaking news and are homogeneous; i.e., an object cannot contain information for more than one trend or piece of news. Certainly, the first popular tweets method would return items 1, 4, and 8, capturing only tweets referring to the recent financial developments in [https://en.wikipedia.org/wiki/Cyprus Cyprus] . On the other hand, the single-word trend titling, in its simplest variant, would return the words ``Greek", ``Cyprus", and ``bailout", failing to offer a comprehensive depiction of the trends. State-of-the-art research on large-scale [http://en.wikipedia.org/wiki/Sparse_PCA sparse principal component analysis (PCA)] D. S. Papailiopoulos, A. G. Dimakis, S. Korokythakis, "Sparse PCA through low-rank approximations," ''arXiv:1303.055lvl [stat.ML]'', Mar. 2013. introduces the concept of Eigen-tweet (ET) basis. An ET is a tweet-like collection of key-words put together to comprehensively describe a piece of news or a trend. Respectively, ET basis is a set of ET's that are content-wise orthogonal to each other and span/capture the most important part of the zeitgeist similarly to the way the eigen-basis spans the range of a matrix. Although based on a simple concept, the Eigen-tweet basis extraction can be quite a challenging task (or even intractable) in terms of complexity, considering that meaningful results can only be derived operating on enormous tweet data sets. Interestingly, authors in , M. Asteris, D. S. Papailiopoulos, and G. N. Karystinos, ``Sparse principal component of a rank-deﬁcient matrix," in ''Proc. IEEE Intern. Symp. Inf. Theory'', St. Petersburg, Russia, Aug. 2011, pp. 673-677. have delivered novel techniques for efficient large scale sparse PCA, that, applied on huge tweet data set can deliver its principal components practically fast. In the sequel, we formally introduce the Eigen-tweets concept. A Long Answer Let a collection of N tweet messages t_1, t_2, \cdots comprise the available tweet data-set \mathcal T=\{ t_1, t_2, \cdots, t_N\} \ \ \ \ \ \ \ \ \ \ (1) with cardinality |\mathcal T|=N . Following the bag-of-words model W. Cohen, ``Bag of words and word positions," in ''Advances in inductive logic programming'', Amsterdam, NL: IOS Press, 1995, pp. 124--143., we build our dictionary \mathcal D=\{w_1, w_2, \cdots, w_M\} \ \ \ \ \ \ \ \ \ \ (2) as a selection of M non-trivial (pronouns, proverbs, pro-adverbs, pro-adjectives, etc. are considered to be trivial) preferably grammatically uncorrelated words appearing in \mathcal T . Certainly, for the problem at hand, we operate in the intersection of high-dimensionality and big-data regimes and N and M are large numbers, at maybe in the order of tens of thousands. Then, we build the boolean mapping matrix \mathbf S \in \{0,1\}^{M \times N} such that [\mathbf S]_{i,j}=1 \Leftrightarrow w_i appears in x_j. \ \ \ \ \ \ \ \ \ \ (3) Importantly, since each tweet constitutes of 140 characters, with the English words being 4.5 characters long on average R. J. Pierce, ''An introduction to information theory: Symbols, signals and noise'', 2nd ed. New York, NY: Dover Publications, 1980., the columns of \mathbf S will be sparse. Next, we define the M \times M nonnegative word-correlation matrix \mathbf A_w \overset{\triangle}{=} \mathbf S \mathbf S^T\ \ \ \ \ \ \ \ \ \ (4) so that [\mathbf A_w]_{i,j} \leq \min_{i,j} \| [\mathbf S]_{i,:}^T \|_1 counts the number of tweets in \mathcal T in which w_i and w_j coexist. In a large-scale weighted graph interpratation, if each word in the dictionary is represented by a vertix and two vertices are connected with an edge weighted by the number of tweets in which these two words coexist, then \mathbf A_w is the graph's (weighted) adjacency matrix. Notice that the N \times N tweet-correlation matrix \mathbf A_t \overset{\triangle}{=} \mathbf S^T\mathbf S is such that [\mathbf A_t]_{i,j} counts the number of common words in t_i and t_j . Vanilla PCA Eigen-tweets A simple method to extract the principal eigen-tweet would be to solve the familiar quadratic-form maximization problem \mathcal P: ~~~~ \begin{array}{l} \underset{\mathbf s\in \mathbb{R}^{M \times M}}{\mathrm{maximize}} ~~\mathbf x^T \mathbf A \mathbf x \\ \mathrm{subject~ to} ~\| \mathbf x\|_2 =1 \end{array} \ \ \ \ \ \ \ \ \ \ (5) known as ``Vanilla PCA." Then, denoting by \mathcal I^* the set of the indices of the positive entries of \mathbf x^* , |\mathcal I^*| = \| \mathbf x^* \|_0 , with \| \cdot \|_0 denoting the l_0 -norm of a vector --counting the number of non-zero entries, we form the eigen-tweet t^* = \bigcup_{i \in \mathcal I^*} w_i. \ \ \ \ \ \ \ \ \ \ (6) The, for now intuitive, rationale behind this eigen-tweet construction is that [\mathbf x^*]_i will have a great value if is w_i popular and, therefore \|[\mathbf A]_{:,i}\|_1 large. Certainly, since there is no guarantee that \mathbf x^* is sparse, t^* may be too long and hard to interpret --hence, inappropriate for an eigen-tweet. This simple observation motivates the pursue of a '''sparse''' principal component of the adjacency matrix \mathbf A_w . Sparse PCA Eigen-tweets An alternative problem formulation for the derivation of the principal eigen-tweet, conforming with the risen sparsity demand, is \mathcal P_s: ~~~~ \begin{array}{l} \underset{\mathbf s\in \mathbb{R}^{M \times M}}{\mathrm{maximize}} ~~\mathbf x^T \mathbf A \mathbf x \\ \mathrm{subject~ to} ~\| \mathbf x\|_2 =1~~ \mathrm{ and }~~ \| \mathbf x\|_0 = k \end{array} \ \ \ \ \ \ \ \ \ \ (7) where k is the number of words comprising the eigen-tweet and \mathcal P_s is the well-known sparse principal-component analysis problem A. d'Aspremont, L. El Ghaoui, M. Jordan, and G. Lanckriet, ``A direct formulation of sparse PCA using semideﬁnite programming," ''SIAM Review'', 49(3), 2007., A. A. Amini and M. Wainwright, ``High-dimensional analysis of semideﬁnite relaxations for sparse principal components," ''The Annals of Statistics'', 37(5B):2877-2921, 2009., B. Moghaddam, Y. Weiss, and S. Avidan, ``Generalized spectral bounds for sparse lda," in ''Proceedings of the 23rd international conference on Machine learning'', pp. 641-648, 2006. , B. Moghaddam, Y. Weiss, and S. Avidan, ``Spectral bounds for sparse PCA: exact and greedy algorithms," ''Advances in Neural Information Processing Systems'', 2006, p. 18., A. d’Aspremont, F. Bach, and L. El Ghaoui ``Optimal solutions for sparse principal component analysis," ''Journal of Machine Learning Research'', 2008, pp. 1269-1294., H. Zou, T. Hastie, and R. Tibshirani, ``Sparse principal component analysis," ''Journal of Computational and Graphical Statistics'', 2006, pp. 265-286., H. Shen and J. Z. Huang , ``Sparse principal component analysis via regularized low rank matrix approximation," ''J. Multivar. Anal.'', July 2008, pp. 1015-1034.. The l_0 cardinality constraint limits the optimization over vectors with k non-zero entries and (7) is known to be [http://en.wikipedia.org/wiki/NP-hard NP-hard] and, thus, computationally intractable in general. In the rest of this work, we will present the state-of-the-art algorithmic developments and results in and show how the zeitgeist can be recorded through efficient sparse PCA on large tweet data-sets. References